46 research outputs found
Regularized Mutual Information Neural Estimation
With the variational lower bound of mutual information (MI), the estimation
of MI can be understood as an optimization task via stochastic gradient
descent. In this work, we start by showing how Mutual Information Neural
Estimator (MINE) searches for the optimal function that maximizes the
Donsker-Varadhan representation. With our synthetic dataset, we directly
observe the neural network outputs during the optimization to investigate why
MINE succeeds or fails: We discover the drifting phenomenon, where the constant
term of is shifting through the optimization process, and analyze the
instability caused by the interaction between the and the
insufficient batch size. Next, through theoretical and experimental evidence,
we propose a novel lower bound that effectively regularizes the neural network
to alleviate the problems of MINE. We also introduce an averaging strategy that
produces an unbiased estimate by utilizing multiple batches to mitigate the
batch size limitation. Finally, we show that regularization achieves
significant improvements in both discrete and continuous settings.Comment: 18 pages, 15 figur
Opening the Black Box of wav2vec Feature Encoder
Self-supervised models, namely, wav2vec and its variants, have shown
promising results in various downstream tasks in the speech domain. However,
their inner workings are poorly understood, calling for in-depth analyses on
what the model learns. In this paper, we concentrate on the convolutional
feature encoder where its latent space is often speculated to represent
discrete acoustic units. To analyze the embedding space in a reductive manner,
we feed the synthesized audio signals, which is the summation of simple sine
waves. Through extensive experiments, we conclude that various information is
embedded inside the feature encoder representations: (1) fundamental frequency,
(2) formants, and (3) amplitude, packed with (4) sufficient temporal detail.
Further, the information incorporated inside the latent representations is
analogous to spectrograms but with a fundamental difference: latent
representations construct a metric space so that closer representations imply
acoustic similarity
A globally exponentially stable position observer for interior permanent magnet synchronous motors
The design of a position observer for the interior permanent magnet
synchronous motor is a challenging problem that, in spite of many research
efforts, remained open for a long time. In this paper we present the first
globally exponentially convergent solution to it, assuming that the saliency is
not too large. As expected in all observer tasks, a persistency of excitation
condition is imposed. Conditions on the operation of the motor, under which it
is verified, are given. In particular, it is shown that at rotor
standstill---when the system is not observable---it is possible to inject a
probing signal to enforce the persistent excitation condition. {The high
performance of the proposed observer, in standstill and high speed regions, is
verified by extensive series of test-runs on an experimental setup
Unsupervised Speech Representation Pooling Using Vector Quantization
With the advent of general-purpose speech representations from large-scale
self-supervised models, applying a single model to multiple downstream tasks is
becoming a de-facto approach. However, the pooling problem remains; the length
of speech representations is inherently variable. The naive average pooling is
often used, even though it ignores the characteristics of speech, such as
differently lengthed phonemes. Hence, we design a novel pooling method to
squash acoustically similar representations via vector quantization, which does
not require additional training, unlike attention-based pooling. Further, we
evaluate various unsupervised pooling methods on various self-supervised
models. We gather diverse methods scattered around speech and text to evaluate
on various tasks: keyword spotting, speaker identification, intent
classification, and emotion recognition. Finally, we quantitatively and
qualitatively analyze our method, comparing it with supervised pooling methods
Automatic Severity Assessment of Dysarthric speech by using Self-supervised Model with Multi-task Learning
Automatic assessment of dysarthric speech is essential for sustained
treatments and rehabilitation. However, obtaining atypical speech is
challenging, often leading to data scarcity issues. To tackle the problem, we
propose a novel automatic severity assessment method for dysarthric speech,
using the self-supervised model in conjunction with multi-task learning.
Wav2vec 2.0 XLS-R is jointly trained for two different tasks: severity level
classification and an auxilary automatic speech recognition (ASR). For the
baseline experiments, we employ hand-crafted features such as eGeMaps and
linguistic features, and SVM, MLP, and XGBoost classifiers. Explored on the
Korean dysarthric speech QoLT database, our model outperforms the traditional
baseline methods, with a relative percentage increase of 4.79% for
classification accuracy. In addition, the proposed model surpasses the model
trained without ASR head, achieving 10.09% relative percentage improvements.
Furthermore, we present how multi-task learning affects the severity
classification performance by analyzing the latent representations and
regularization effect
Speech Intelligibility Assessment of Dysarthric Speech by using Goodness of Pronunciation with Uncertainty Quantification
This paper proposes an improved Goodness of Pronunciation (GoP) that utilizes
Uncertainty Quantification (UQ) for automatic speech intelligibility assessment
for dysarthric speech. Current GoP methods rely heavily on neural
network-driven overconfident predictions, which is unsuitable for assessing
dysarthric speech due to its significant acoustic differences from healthy
speech. To alleviate the problem, UQ techniques were used on GoP by 1)
normalizing the phoneme prediction (entropy, margin, maxlogit, logit-margin)
and 2) modifying the scoring function (scaling, prior normalization). As a
result, prior-normalized maxlogit GoP achieves the best performance, with a
relative increase of 5.66%, 3.91%, and 23.65% compared to the baseline GoP for
English, Korean, and Tamil, respectively. Furthermore, phoneme analysis is
conducted to identify which phoneme scores significantly correlate with
intelligibility scores in each language.Comment: Accepted to Interspeech 202
Learning with Noisy Labels by Efficient Transition Matrix Estimation to Combat Label Miscorrection
Recent studies on learning with noisy labels have shown remarkable
performance by exploiting a small clean dataset. In particular, model agnostic
meta-learning-based label correction methods further improve performance by
correcting noisy labels on the fly. However, there is no safeguard on the label
miscorrection, resulting in unavoidable performance degradation. Moreover,
every training step requires at least three back-propagations, significantly
slowing down the training speed. To mitigate these issues, we propose a robust
and efficient method that learns a label transition matrix on the fly.
Employing the transition matrix makes the classifier skeptical about all the
corrected samples, which alleviates the miscorrection issue. We also introduce
a two-head architecture to efficiently estimate the label transition matrix
every iteration within a single back-propagation, so that the estimated matrix
closely follows the shifting noise distribution induced by label correction.
Extensive experiments demonstrate that our approach shows the best performance
in training efficiency while having comparable or better accuracy than existing
methods.Comment: ECCV202
TiDAL: Learning Training Dynamics for Active Learning
Active learning (AL) aims to select the most useful data samples from an
unlabeled data pool and annotate them to expand the labeled dataset under a
limited budget. Especially, uncertainty-based methods choose the most uncertain
samples, which are known to be effective in improving model performance.
However, AL literature often overlooks training dynamics (TD), defined as the
ever-changing model behavior during optimization via stochastic gradient
descent, even though other areas of literature have empirically shown that TD
provides important clues for measuring the sample uncertainty. In this paper,
we propose a novel AL method, Training Dynamics for Active Learning (TiDAL),
which leverages the TD to quantify uncertainties of unlabeled data. Since
tracking the TD of all the large-scale unlabeled data is impractical, TiDAL
utilizes an additional prediction module that learns the TD of labeled data. To
further justify the design of TiDAL, we provide theoretical and empirical
evidence to argue the usefulness of leveraging TD for AL. Experimental results
show that our TiDAL achieves better or comparable performance on both balanced
and imbalanced benchmark datasets compared to state-of-the-art AL methods,
which estimate data uncertainty using only static information after model
training.Comment: ICCV 2023 Camera-Read
Reliable Decision from Multiple Subtasks through Threshold Optimization: Content Moderation in the Wild
Social media platforms struggle to protect users from harmful content through
content moderation. These platforms have recently leveraged machine learning
models to cope with the vast amount of user-generated content daily. Since
moderation policies vary depending on countries and types of products, it is
common to train and deploy the models per policy. However, this approach is
highly inefficient, especially when the policies change, requiring dataset
re-labeling and model re-training on the shifted data distribution. To
alleviate this cost inefficiency, social media platforms often employ
third-party content moderation services that provide prediction scores of
multiple subtasks, such as predicting the existence of underage personnel, rude
gestures, or weapons, instead of directly providing final moderation decisions.
However, making a reliable automated moderation decision from the prediction
scores of the multiple subtasks for a specific target policy has not been
widely explored yet. In this study, we formulate real-world scenarios of
content moderation and introduce a simple yet effective threshold optimization
method that searches the optimal thresholds of the multiple subtasks to make a
reliable moderation decision in a cost-effective way. Extensive experiments
demonstrate that our approach shows better performance in content moderation
compared to existing threshold optimization methods and heuristics.Comment: WSDM2023 (Oral Presentation
Improved performance of polymer light-emitting diodes with nanocomposites
The characteristics of a hybrid polymer light-emitting diode (HPLED) with an active layer of poly [2-methoxy,5-(2-ethylhexoxy)-1,4-phenylenevinylene] blended with Au-capped TiO(2) nanocomposites are reported. Both the increased current in the active layer and low turn-on voltage were attributed to incorporation of Au-capped TiO(2) in the electroluminescent polymer. The maximal brightness of 11 630 cd/m(2) was observed in HPLED with a 1:1 ratio of Au-capped TiO(2). The enhanced performance was attributed to the roughness assisted charge transport induced by the Au-capped TiO(2) nanocomposites in the active polymer.open111